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Abstract 


Background: We have compared 38 isolates of the SARS-CoV complete genome. The main goal 
was twofold: first, to analyze and compare nucleotide sequences and to identify positions of single 
nucleotide polymorphism (SNP), insertions and deletions, and second, to group them according to 
sequence similarity, eventually pointing to phylogeny of SARS-CoV isolates. The comparison is 
based on genome polymorphism such as insertions or deletions and the number and positions of 


SNPs. 


Results: The nucleotide structure of all 38 isolates is presented. Based on insertions and deletions 
and dissimilarity due to SNPs, the dataset of all the isolates has been qualitatively classified into 
three groups each having their own subgroups. These are the A-group with "regular" isolates (no 
insertions / deletions except for 5' and 3' ends), the B-group of isolates with "long insertions", and 
the C-group of isolates with "many individual" insertions and deletions. The isolate with the 
smallest average number of SNPs, compared to other isolates, has been identified (TWH). The 
density distribution of SNPs, insertions and deletions for each group or subgroup, as well as 


cumulatively for all the isolates is also presented, along with the gene map for TWH. 


Since individual SNPs may have occurred at random, positions corresponding to multiple SNPs 
(occurring in two or more isolates) are identified and presented. This result revises some previous 
results of a similar type. Amino acid changes caused by multiple SNPs are also identified (for the 
annotated sequences, as well as presupposed amino acid changes for non-annotated ones). Exact 
SNP positions for the isolates in each group or subgroup are presented. Finally, a phylogenetic tree 
for the SARS-CoV isolates has been produced using the CLUSTALW program, showing high 


compatibility with former qualitative classification. 


Conclusions: The comparative study of SARS-CoV isolates provides essential information for 
genome polymorphism, indication of strain differences and variants evolution. It may help with the 


development of effective treatment. 
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Background 

Severe Acute Respiratory Syndrome (SARS) is a new infec- 
tious disease reported first in the autumn of 2002 and 
diagnosed for the first time in March 2003 [1]. It is still a 
serious threat to human health and SARS coronavirus 
(CoV) has been associated with the pathogenesis of SARS 
according to Koch's postulate [2]. 


Significant research efforts have been made into investiga- 
tion of the SARS-CoV genome sequence, aimed at estab- 
lishing its origin and evolution to help eventually in 
preventing or curing the disease it causes. Although the 
task is a hard one, it opens up the opportunity, amongst 
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others, for comparative investigation of different SARS- 
CoV isolates aimed at identification of genome regions 
properties expressing different levels of sequence poly- 
morphism [3-8]. 


The genome of SARS-CoV consists of a single positive 
RNA strand approximately 30 Kb in length, consisting of 
about 10 open reading frames (ORF), and about 10 inter- 
genic regions (IGRs). The first two overlapping ORFs at 
the 5' end encompass two-thirds of the genome, while the 
rest of the ORFs at the 3' end account for the remaining 
third. 


Table |: List of the SARS-CoV complete genome isolates investigated. Included are isolates’ labels, IDs, accession numbers, length in 
nucleotides, dates of revisions considered and countries and sources of isolates. 


Label ID Accession No. Length Revision date Country/Source 

| TWH Ap006557. | 29727 02-AUG-2003 Taiwan: patient #01 
TWC2 Ay362698. | 13-AUG-2003 Taiwan: Hoping Hospital 

2: TWC3 Ay362699. | 29727 13-AUG-2003 Taiwan: Hoping Hospital 

3. TWK Ap006559. | 29727 02-AUG-2003 Taiwan: patient #06 

4. TWS Ap006560. | 29727 02-AUG-2003 Taiwan: patient #04 

5. TWY Ap006561.1 29727 02-AUG-2003 Taiwan: patient #02 

6. Urbani Ay278741.| 29727 12-AUG-2003 USA: Atlanta 

7. TW) Ap006558. | 29725 02-AUG-2003 Taiwan: patient #043 

8. TWC Ay321118.1 29725 26-JUN-2003 Taiwan, first fatal case 

9 WHU Ay394850.2 29728 12-JAN-2004 China: Wuhan 

10. TW! Ay29145 1.1 29729 14-MAY-2003 Taiwan 

Il. Frankfurt | Ay291315.1 29727 | 1-JUN-2003 Germany: Frankfurt 

12. FRA Ay310120.1 29740 12-DEC-2003 Germany: patient from Frankfurt 

13. HKU-39849 Ay27849 |.2 29742 29-AUG-2003 China: Hong Kong 

14. Tor2 Ay274119.3 2975 | 16-MAY-2003 Canada: Toronto, patient #2 

Nc_004718.3 06-FEB-2004 Canada: Toronto, patient #2 

15. HSR | Ay323977.2 2975 | 15-OCT-2003 Italy 

16. CUHK-Su10 Ay282752.2 29736 17-NOV-2003 China: Hong Kong 

17. CUHK-WI Ay278554.2 29736 31-JUL-2003 China: Hong Kong 

18. GZ50 Ay304495. | 29720 05-NOV-2003 China: Hong Kong 

19. AS Ay427439.| 29711 21-OCT-2003 Italy: Milan 

20. Sin2500 Ay283794. | 29711 12-AUG-2003 Singapore 

21. $in2679 Ay283796. | 29711 12-AUG-2003 Singapore 

22. $in2774 Ay283798.2 29711 02-OCT-2003 Singapore 

23. Sin2677 Ay283795.| 29705 12-AUG-2003 Singapore 

24. S$in2748 Ay283797.| 29706 12-AUG-2003 Singapore 

25. BJOl Ay278488.2 29725 01-MAY-2003 China: Beijing 

26. BJO2 Ay278487.3 29745 05-JUN-2003 China: Beijing 

27. BJO3 Ay278490.3 29740 05-JUN-2003 China: Beijing 

28. BJO4 Ay279354.2 29732 05-JUN-2003 China: Beijing 

29. Taiwan TCI Ay338174.| 29573 28-JUL-2003 Taiwan 

30. Taiwan TC2 Ay338175.| 29573 28-JUL-2003 Taiwan 

31. Taiwan TC3 Ay3483 14.1 29573 29-JUL-2003 Taiwan 

32. GDO| Ay278489.2 29757 18-AUG-2003 China: Beijing 

33. $Z3 Ay304486. | 2974 | 05-NOV-2003 China: Hong Kong 

34. SZ16 Ay304488. | 2973 | 05-NOV-2003 China: Hong Kong 

35. ZjOl Ay297028. | 29715 19-MAY-2003 China: Beijing 

36. ZMY | Ay35 1680. | 29749 03-AUG-2003 China: Guangdong 
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Same structure genomes: TWC3,TWK,TWS,TWY,Urbani and Frankfurt 1 as TWH; HSR 1 as Tor 2; CUHK-W1 as CUHK-Sul0; $in2500, $in2679 and Sin 


2774 as AS; Taiwan TC2 and Taiwan TC3 as Taiwan TC1. 


Figure | 


Comparison of nucleotide structures of SARS-CoV complete genome isolates. Insertions are denoted as empha- 


sized (italic) and > , deletions by minus sign ( 


). Positions are given in relation to the TWH isolate. The two isolates with a 


large number of individual insertions (ZJO!, ZMY |) are given separately, with exact positions of insertions and deletions. 


We investigated 38 isolates of the SARS-CoV complete 
genome (two pairs of which were identical), sequenced 
and published by October 318t 2003 (with updated revi- 
sions up to February 20, 2004). Sequences were taken 
from the PubMed NCBI Entrez site [9] in gbk and fasta 
formats (Table 1). The main goal was twofold: first, to 
analyze and compare nucleotide sequences, to identify 
SNPs positions, insertions and deletions, and second, to 
group them according to sequence similarity, eventually 
pointing to phylogeny of SARS-CoV isolates. 


According to the length of isolates (insertions and dele- 
tions) and the presence of SNPs, we classified them into 
three main groups with subgroups: "regular" isolates with 
no insertions or deletions (with different numbers of 
SNPs), isolates with "long insertions" and isolates with 
"many individual" insertions and deletions (with differ- 
ent positions of SNPs), which is close to phylogenetic 
analysis results. 


Results and discussion 

Genome polymorphism 

All the sequences are between 29573 and 29757 in length 
(Table 1), with a high degree of similarity (>99% pair- 
wise). Still, they can be differentiated on the basis of 
sequence polymorphism (insertions and deletions), 
number and sites of SNPs [8]. Results of the comparison 
of genome primary structure of the analyzed isolates are 
given in Figure 1. 


Analysis of genomic polymorphism of the isolates 
resulted in the following facts 


I) Some of the isolates are nucleotide-identical or almost 
identical. There are two pairs of nucleotide-identical iso- 
late sequences: (TWH, TWC2) and Tor2 (with accession 
numbers Ay274119, Nc_004718). Therefore, instead of 
38, we consider the dataset to contain 36 isolates. Further, 
the isolate TWC3 differs in just one position with TWH 
(see table in additional file 1), which is about randomly 
expected [11]. Isolates Frankfurt 1 and FRA are identical 
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up to the poly-"a" of length 13 present at the 3' end of FRA 
(Figure 1). 


II) Similarity analysis showed that a significant number of 
isolates have the same length (29727 bases), the same 
beginning and ending subsequences (that seem to be 
exact starts and ends of the complete SARS-CoV genome 
up to the poly-"a" at the 3' end), thus forming a kind of 
referent group; these are the isolates TWH, TWC3, TWK, 
TWS, TWY, Urbani, Frankfurt 1 (Figure 1). The fully 
sequenced isolate TWH then has been chosen as the refer- 
ent isolate for sequence comparisons since its average 
number of SNPs compared to other isolates is the small- 
est. For example, TWH and Urbani have an average 
number of SNPs 15.7 and 17.6 respectively for all the iso- 
lates, and 5.7 and 10.5 respectively for the referent group. 
For SNPs see the tables in the additional files 1 and 2. 


II) Most isolates, compared to TWH, are shorter at the 
5'end (eg., Sin2500, Sin2679, Sin2774, Sin2677, 
Sin2748, AS), have various length poly-"a" strings at the 
3' end (e.g., Tor2, HSR1, FRA, BJO2, TW1, HKU-39489, 
WHU), or both (BJO1, BJO3, BJO4, CUHK-W1, CUHK- 
Su10). Three of the isolates, Taiwan TC1, Taiwan TC2, Tai- 
wan TC3, have both starting and ending deletions (at the 
5' end 69, at the 3' end 85 nucleotides). Several isolates 
(e.g. TWJ, TWC, Sin2677, Sin2748) have some short dele- 
tions inside the sequence (Figure 1). 


IV) There is a group of isolates that have significant length 
insertions (29 nucleotides) inside the sequence. These are 
the isolates GDO1, SZ3, SZ16. A significant number of 
individual insertions have been identified in ZJO1 and 
ZMY 1 isolates (Figure 1, additional files 3,4,5). 


Among the SNP contents of isolates, there is a significant 
difference in the number of SNPs for different pairs of iso- 
lates. For TWH as the referent isolate, this number varies 
from 1 to 80 SNPs. Isolates may be classified into three 
groups based on the number of SNPs with TWH (Figure 
2): 


1. with less than 15 (TWC3, TWK, TWS, TWY, Urbani, 
TWJ, TWC, TW1, Tor2, HSR1, CUHK-Su10, AS, Sin2500, 
Sin2679, Sin2774, Sin2677, Sin2748, Taiwan TC1, Tai- 
wan TC2, Taiwan TC3, Frankfurtl, FRA, HKU-39849, 
CUHK-W1), 


2. between 15 and 30 (WHU, GZ50, BJO1-BJ04, ZJ01), 


3. with equal to or greater than 30 SNPs (GDO1, SZ3, 
SZ16, ZMY 1). 


Finally, besides the number, there are differences in posi- 
tions of SNPs (potential mutation sites). In order to avoid 
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nucleotide changes that probably arose during propaga- 
tion of the virus in cell culture and sequencing, Figure 3 
represents positions (on the relative scale of all isolates 
and on TWH scale) where two or more SNPs occurred, not 
taking into consideration isolates with long insertions 
(GD01, SZ3 and SZ16). The positions of multiple SNPs of 
these three isolates, similar as far as these three are con- 
cerned, are highly different from all the others and are rep- 
resented in Figure 4. These results coincide with those 
published in Marra et al's paper [4] for Urbani and Tor2 
isolates, but differ from those published in Ruan's paper 
[8] for the 14 isolates therein analyzed (Sin-group, BJ- 
group, Tor2, Urbani, CUHK-W1, HKU-39849, GD01), 
which were obviously based on different revisions of the 
PubMed NCBI Entrez database [9]; lengths of the 
sequences Tor2, CUHK-W1, GD01, BJO1-BJ04 differ from 
the revisions we analyzed and consequently in some 
nucleotides and the number of base changes at given posi- 
tions. Differences include the following positions (based 
upon Urbani and TWH SARS-CoV): 2601 (Tor2 T instead 
of C, BJO4 T instead of missing base), 7919 (BJO3 C 
instead of T), 8559 (BJO4 T instead of A), 8572 (BJO1 T 
instead of G, GD01 G instead of T), 9404 (BJ04 T instead 
of missing base), 9479 (BJ04 T instead of missing base), 
9854 (BJ04 T instead of missing base), 19838 (GDO1 G 
instead of A), 21721 (GDO1, BJO1, A instead of missing 
base, BJO4 G instead of missing base), 22222 (BJO4 C 
instead of N), 27243 (GDO1 T instead of C, BJ03 T instead 
of N), 29279 (all A's). The results obtained also differ 
from Hsueh et al. [12] regarding nucleotides in HKU- 
39849 isolate on positions 7746, 9404, 9479, 17564, 
17846, 19064, 21721, 22222, 27827. 


Additional file 1,2,3,4,5 represent SNPs for all the isolates 
in all five groups, whether they occur in ORFs or IGR (for 
annotated isolates), as well as the number of SNPs in 
ORFs and SNPs in IGR, per isolate. The total number of 
SNPs is 312 (only 2 in IGRs: TWH positions 27812 for the 
isolate Taiwan TC3 and 27827 for the isolates BJO1 and 
CUHK-W1). The average number of SNPs per isolate is 
15.7 and significant difference from the average shows 
TWC3 (just 1 SNP) and ZMY 1 (even 80). 


Grouping of isolates 

The isolates from the dataset considered may be classified 
according to their sequence polymorphism and SNP con- 
tents properties just described. At first, properties (III, IV) 
may result in three different groups (Figure 2): 


A. "regular isolates" whose nucleotide structure is close to 
the referent group (different 5' and 3' ends, short deletion, 
individual insertion): TWH, TWC3, TWK, TWS, TWY, 
Urbani, TWJ, TWC, TW1, Tor2, HSR1, CUHK-Su10, AS, 
$in2500, Sin2679, Sin2774, Sin2677, Sin2748, Taiwan 
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SARS-CoV isolates 
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Figure 2 


GDO1 


ip B 
figuration 


SZ3._ SZ16 


ZJO1 ZMY 1 


y BIOs *>TGCC Subconfiguration 


Structural tree for SARS-CoV isolates. The tree is based on qualitative analysis of sequence variation of 36 isolates. 


TCl1, Taiwan TC2, Taiwan TC3, WHU, Frankfurt1, FRA, 
HKU, CUHK-W1, GZ50 and BJO1-BJ04 (Figure 5, 6a) 


B. isolates with "long insertions": GD01, SZ3 and SZ16 
(Figure 6b) and 


C. isolates with "many individual" insertions: ZJO1 and 
ZMY 1 (Figure 7a,7b). 


Further, SNPs properties (1-3) may divide A group into 
Al and A2, and C group into C1 and C2 subgroups: 


Al. TWH, TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, 
TW1, Tor2, HSR1, CUHK-Sul10, AS, Sin2500, Sin2679, 
$in2774, Sin2677, Sin2748, Taiwan TC1, Taiwan TC2, 
Taiwan TC3, Frankfurtl, FRA, HKU and CUHK-W1 (Fig- 
ure 5) 
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Relative TWH A Ac A Ac 
protein scale scale TWH | GDO1 | SZ3 | SZ16 changes position | A Ac properties changes 
tab 1209 1206 | T T Cc Cc Silent (Asn) 
1ab 1912 1909 | G G T T Ala>Ser Hp+S+T> P+S+T 
tab 3331 3326 | T T Cc Cc Val>Ala Hp+S+Ap> Hp+S+T 
tab 3631 3626 | T Cc Cc Cc lle>Thr 1121 Hpt+tAp> Hp+P+S 
tab 3676 3671 | C cS aL) T Pro>Leu S— Hp+Ap 
Jab 5259 5251 | C Cc A A Leu-lle Hp+Ap> Hp+Ap 
1ab 6466 6456 | A A G G Silent 
1ab 6622 6612 | G T T T Leu>Phe 2116 Hp+Ap> Hp+Ar 
tab 6939 6929 | G A A A Cys> Tyr 2222 Hp+P+S+T> Hp+P+Ar 
tab 7080 7070 | T T Cc Cc Leu> Ser Hpt+Ap> P+S+T 
1ab 8514 8502 | T T G G Cys Trp Hp+P+S+T> Hp+P+Ar 
1ab 8571 8559 | T Cc Cc Cc Silent 
tab 9189 9176 | T Cc Cc Cc Val> Ala 2971 Hp+S+Ap— Hp+S+T 
tab 9492 9479 | T Cc Cc Cc Val> Ala 3072 Hp+S+Ap> Hp+S+T 
1ab 13881 13862 | C Cc T T Silent 
1ab 20868 20840 | G G A A Silent 
tab 21020 20992 | G G A A Arg>Lys P+PCh> Hp+P+PCh 
Ss 22200 22172 | C Cc A A Asn> Lys P+S — Hp+P+PCh 
Ss 22235 22207 | C T T T Ser>Leu 239 P+S+T— Hp+Ap 
Ss 22301 22273 | C Cc A A Thr Lys Hp+P+S— Hp+P+PCh 
Ss 22544 22517 | A G G G Silent (Arg) 
Ss 22549 22522 | A G G G Lys>Arg Hp+P+PCh > P+PCh 
Ss 22598 22570 | T T Cc Cc Phe-Ser Hpt+Ar > P+S+T 
Ss 22957 22928 | T T A A Asn>Lys P+S— Hp+P+PCh 
Ss 22980 22951 | C Cc G G Thr>Ser Hp+P+S—> P+S+T 
Ss 23339 23310 | T T Cc Cc Ser—Pro P+S+T> S$ 
Ss 23514 23485 | T T Cc Cc Leu-Ser Hpt+Ap> P+S+T 
Ss 23622 23593 | C Cc T T Ser>Leu P+S+T > Hp+Ap 
Ss 23747 23718 | A A G G Thr>Ala Hp+P+S—> Hp+S+T 
Ss 23781 23752 | C Cc T T Ala>Val Hp+S+T> Hp+S+Ap 
Ss 23852 23823 | T G G G Tyr>Asp 778 Hp+P+Ar—> P+S+NCh 
Ss 24200 24171 | A A G G Thr>Ala Hp+P+S— Hp+S+T 
Ss 24595 24566 | T Cc Cc Cc Silent 
Ss 25007 24978 | A A G G Lys>Glu Hp+P+PCh > P+NCh 
hyp 25316 25286 | T il A A Phe-lle Hp+Ar> Hp+Ap 
hyp 25538 25508 | T T A A Cys>Ser Hp+P+S+T > P+S+T 
hyp 25574 25544 | C Cc T T His>Tyr Hp+P+PCh > Hp+P+Ar 
hyp 25658 25628 | T T G G Cys>Gly Hp+P+S+T— Hp+S+T 
M 26440 26410 | G G A A Gly>Ser Hp+S+T> P+S+T 
M 26507 26477 | G T T T Cys—>Phe 27 Hp+P+S+T—> Hp+Ar 
M 26616 26586 | T T Cc Cc Silent 
hyp 27858 27827 | T Cc Cc Cc Cys>Arg 17 Hp+P+S+T> P+PCh 
Figure 4 


Positions with two or more SNPs in B group with amino acid changes. Only SNPs in B group isolates, regarding 
TWH, have been counted. The same notation is applied as in Figure 3. 
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Density distribution of SNPs, insertions and deletions in the isolates of Al group. SNPs are represented above the 
line, insertions below the line, upward oriented, and deletions below the line, downward oriented. The TWH scale is used. 


The same holds for Figures 6,7,8. 


A2. WHU, BJO1-BJ04 and GZ50 (Figure 6a) 
Cl: ZJO1 (Figure 7a) 
C2: ZMY 1 (Figure 7b) 


Finally, the positions of SNPs will move CUHK-W1 from 
Al into A2 group (more than 50% of common SNP posi- 
tions) while WHU will move from A2 into A1 (less than 
30% of common SNP positions), giving the final group- 
ing of isolates presented as a structural tree (Figure 2): 


Al. TWH, TWC3, TWK, TWS, TWY, Urbani, TWJ, TWC, 
TW1, Tor2, HSR1, CUHK-Su10, AS, Sin2500, Sin2679, 
S$in2774, Sin2677, Sin2748, Taiwan TC1, TC2, TC3, 
Frankfurt1, FRA, HKU and WHU (Figure 5 and the addi- 
tional file 1) 


A2. CUHK-W1, GZ50 and BJO1-BJ04 (Figure 6a and the 
additional file 2) 


B. GDO1, SZ3 and SZ16 (Figure 6b and the additional file 
3) 


C1. ZJO1 (Figure 7a and the additional file 4) 

C2. ZMY 1 (Figure 7b and the additional file 5). 
Although qualitative in nature, the structural tree turns 
out to be close to the quantitative grouping which is a 
basis for (computational) phylogenetic classification. 
Tables in additional files 1,2,3,4,5 represent SNPs, inser- 


tions and deletions in groups A-C (see additional files 1 
for isolates of Al group, on the relative and TWH scale, 
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Figure 6 


(a and b). Density distribution of SNPs, insertions and deletions in the isolates of A2 and B groups. In A2 group 
there are no insertions / deletions. In B group there are large insertions in GDOI, SZ3 and SZ16 isolates. 


additional files 2 for isolates of A2 group and TWH, on the 
TWH scale, additional files 3 for isolates of group B with 
TWH, and additional files 4,5 for isolates of C1, C2 
groups, respectively). Figures 5,6,7 represent density dis- 
tribution of SNPs, insertions and deletions on the TWH 
scale, for the same groups of isolates. Figure 8 represents 
the overall density distribution of SNPs, insertions and 
deletions for all the 36 isolates, along with the gene map 
for TWH (which is quite similar to gene maps of other iso- 
lates). Density distributions do not show regularities yet 
(with respect to the number of available sequences) that 
could provide for precise statistical characterization. Still, 
they exhibit crowding regions close to the 3' end which is 
also characterized by the presence of a number of proteins 
of unknown function. 


It can also be noted that the proposed grouping of 36 iso- 
lates, based on different criteria, still conserves the previ- 
ous classification T-T-T-T / C-G-C-C [8]. All the isolates 
from groups Al and C have T-T-T-T configuration, while 
all the isolates from groups A2 and B have C-G-C-C con- 
figuration, except for GZ50, BJ04 being TGCC (Figure 2, 


Figure 9). The two sequence variants correspond to the 
epidemiological spread, so that those that originated in 
the Hotel M in Hong Kong have the T-T-T-T configuration 
— covering Al, C groups in our classification - Canada 
(Tor2), Singapore (all Sins), Frankfurt, Taiwan, Hong 
Kong (HKU39849), Hanoi (Urbani), Italy (HSR1), China 
(ZJO1), etc, and others having C-G-C-C configuration (A2, 
B in our classification) which originated in Guangdong, 
China (GD01 and GZ50), Hong Kong (CUHK-W1, SZ3 
and SZ16), Beijing (BJO1-BJ04). The fact that the enlarged 
number of isolates exhibits the same properties relating to 
the four loci supports the assumption that the mutations 
could not have arisen by chance base substitution during 
propagation in cell culture and the sequencing procedure 


[8]. 


Changes in amino acids 

We analyzed amino acid changes in proteins for the anno- 
tated isolates (19 out of 36), and presumed proteins in 
non-annotated ones for multiple SNPs in all the isolates. 
Results of the analysis are represented in Figures 3 and 4. 
Figure 3 shows that silent mutations occurred in envelope 
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(a and b). Density distribution of SNPs, insertions and deletions in the isolates of Cl, C2 groups. 
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The overall density distribution of SNPs, insertions and deletions along with the gene map for TWH. 
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ZMY41|AY351680.1| 


$Z16|AY304488.1] 
$Z3|AY304486.1| 
GDO1AY278489..2| 


Phylogenetic tree of 36 SARS-CoV complete genome isolates. Distances represent degree of sequence variation. The 
largest distance is associated with ZMY |, followed by ZJOI isolate (groups Cl, C2). Groups Al, A2 and B are clearly distin- 
guished. The tree has been obtained using CLUSTALW and PhyloDraw programs. 


protein E, while nucleotide changes resulted in amino 
acid changes in spike (S), membrane (M) and 
nucleocapside (N) proteins. All three SNPs in the spike 
protein are situated in the outer membrane region and not 
within the potential epitope region (amino acid position 
469-882) as proposed by Ren Y. et al. [13]. Amino acid 


changes occurred in two multiple SNPs in M protein, one 
multiple SNPs in N protein and 7 (out of 13) multiple 
SNPs of the polyprotein lab, as well as in one multiple 
SNP of a hypothetical protein, while the silent mutations 
occurred in three hypothetical proteins. Figure 3 also rep- 
resents properties of the corresponding amino acids 
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resulted by SNPs. The only significant change in amino 
acid properties is in S protein GlyAsp (A2, B groups, i.e., 
in CUHK-W1, GZ50, BJO1-BJ03, GD01, SZ3 and SZ16 iso- 
lates) and hypothetical protein Cys—Arg (the same iso- 
lates, BJO4 in addition). The only addition in non- 
annotated sequences is in hypothetical protein following 
S protein in TWH, exhibiting silent change, and in non- 
annotated BJO2 and BJO3, corresponding to the hypothet- 
ical protein, Gly>Glu. Similar analysis can be done for 
amino acid changes corresponding to SNPs at positions 
specific for B group isolates (Figure 4). Taking into 
account the only annotated isolate GDO1, there are five 
amino acid changes in polyprotein lab, two amino acid 
properties changes in S protein (Ser—Leu and Tyr—Asp, 
the second being within the epitope region), one amino 
acid change in M protein and one amino acid property 
change (Cis>Arg) in BGI-PUP. 


Phylogenetic analysis 

The SARS-CoV isolates have been multialigned using the 
CLUSTALW program [10] as the very first step in 
obtaining a phylogenetic tree. The aligned sequences have 
been submitted then to CLUSTALW for bootstrapping 
and phylogenetic tree production. Enlargement of the 
sequence set resulted in the refinement of the phyloge- 
netic tree produced, as compared to previous results such 
as Ruan [8] and Zhang&Zheng [14], obtained for 14 and 
16 isolates, respectively. The phylogenetic tree obtained, 
drawn using the PhyloDraw program [15], is represented 
in Figure 9. It is similar to our structural tree based on 
qualitative analysis of the isolates (Figure 2). 


The results of the analysis of dissimilarities, described in 
previous paragraphs, are in accordance with the 
alignment obtained by CLUSTALW, but regrouped and 
formatted in a way that facilitates further interpretation 
and application. 


Conclusion 
Comparative analysis of genome sequence variations of 
38 SARS-CoV isolates resulted in some conclusions that 
might be of interest in further investigation of the SARS- 
CoV genome: 


1. All of the SARS-CoV isolates are highly homologous 
(more than 99% pairwise). Most of them have similar 
nucleotide structure, with the same 5' and 3' ends and 
poly-"a" at the 3' end of different length (0-24), some of 
them with a single short deletion close to the 3' end of the 
sequence; out of 312 SNPs in total, only two are in IGRs. 


2. Three of the 38 isolates have long insertions within the 
sequence; 
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3. Two of the isolates have a large number of individual 
insertions / deletions, exhibiting different SNP positions; 


4. All the isolates may be grouped according to sequence 
polymorphism into three groups (with up to two sub- 
groups), reflecting their similarities / dissimilarities. Since 
the isolate sequences have a high degree of homology, 
different properties of groups are represented in a more 
transparent way in the classification tree obtained by such 
a qualitative analysis, than in a bootstrapped phylogenetic 
tree obtained from multialigned sequences using the 
CLUSTALW program [10]. 


5. The total number of amino acid changes caused by mul- 
tiple SNPs is 15 (in isolates of A, C groups) and 34 in 
isolates of B group. The total number of silent mutations 
is 10 (for A, C groups) and 7 (for B group). 


6. Since S protein is of special interest regarding its recep- 
tor affinity and antigenecity, it is interesting to notice that 
all amino acid properties' changes are located in its outer 
membrane region, one for A, C groups and two for B 


group. 


7. The results obtained may be useful in further investiga- 
tion aiming at identification of SARS-CoV genome regions 
responsible for its infectious nature. 


Methods 

Dataset 

We investigated the complete genomes of 38 SARS-CoV 
isolates. Nucleotide sequences are taken from the PubMed 
NCBI Entrez database [9] in gbk and fasta formats (Table 


1). 


The coverage included all the isolates published by Octo- 
ber 318* 2003 (with updated revisions). The identifiers, 
accession numbers, genomic size (in nucleotides), revi- 
sion dates and country or source of the isolates considered 
are included in the table, together with labels as referred 
in this paper. The fully sequenced isolate TWH has been 
chosen as the referent isolate, since its average number of 
SNPs was the lowest as compared to all other isolates. 


Methods for similarity analysis 
For similarity analysis of isolates, the following procedure 


has been applied consisting of two steps: 


1. identification of structurally identical parts of isolates, 
i.e., insertion and deletion sites 


2. identification of SNPs in structurally identical parts. 
Step 1 has been carried out by a function performing sim- 


ilarity analysis of subsequences of a given length (e.g., 100 
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bps), and identifying significantly non-matching strings 
as being inserted in the corresponding sequence (i.e. 
deleted from the other). Since significant number of iso- 
lates have the same length (29727 bases) and starting and 
ending subsequences (that seem to be the exact starts and 
ends of the complete SARS-CoV genome up to the poly- 
"a" at the 3' end), they may be considered as forming a 
representative group. The nucleotide structure of all other 
isolates was analyzed with respect to this representative 
group. For each pair of isolates (x,y) (x from the represent- 
ative group), a file InsDelx-y has been produced contain- 
ing positions and lengths of each of the insertions or 
deletions in the isolate y. 


Step 2 has been carried out by comparing structurally 
identical parts (of the same length) of pairs of isolates. 
The starting and ending positions of those parts have been 
taken from the file InsDelx-y (for comparison of x and y), 
produced in step 1. The procedure returns results in a file 
with SNPs in the two sequences (files Mismx-y). 


We also used the CLUSTALW program [10] for multialign- 
ment as a control process, as well as for phylogenetic 
investigations. 


Methods for phylogenetic investigation 

In order to use similarity analysis results for drawing any 
phylogenetic conclusions about the SARS-CoV genome 
dataset, a CLUSTALW [10] multialigned output has been 
generated and a bootstrapped phylogenetic tree has been 
produced and drawn using the PhyloDraw program [15]. 
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Additional material 


Additional File 1 

Positions of SNPs in Al group. Positions are given on the relative and 
TWH scales. IDs of annotated isolates are in grey boxes; SNPs in ORFs 
(or corresponding to those in ORFs, for non-annotated isolate) are in red 
bold and SNPs in IGRs in blue bold. The total number of SNPs per isolate 
is given at the bottom, as well as number of SNPs in ORFs and IGRs for 
annotated isolates. A minus sign (-) denotes deletion. 

Click here for file 
[http://www.biomedcentral.com/content/supplementary/ 147 1- 
2105-5-65-S1.xls] 


Additional File 2 

Positions of SNPs in A2 group. Positions are given on the TWH scale. The 
same notation is applied as in the additional file 1. 

Click here for file 
[http://www.biomedcentral.com/content/supplementary/1471- 
2105-5-65-S2.xls| 


Additional File 3 

Positions of SNPs and insertions in B group. The exact positions on all four 
scales (TWH, GD01, SZ3 and SZ16) are given. ID of the only annotated 
isolate (GDO1) is in grey box; SNPs in ORFs (or corresponding to those 
in ORFs, for non-annotated isolates) are in red bold. The total number of 
SNPs per isolate is given at the bottom, as well as the number of SNPs in 
ORFs and IGRs for annotated isolate. Small letters denote insertion and 
a minus sign (-) denotes the corresponding deletion. 

Click here for file 
[http://www.biomedcentral.com/content/supplementary/1471- 
2105-5-65-S3.xls] 


Additional File 4 

Positions of SNPs, insertions and deletions in C1 group. Positions of 
SNPs, insertions and deletions on both TWH and ZJ01 scales are given. 
The total number of SNPs is given. SNPs are in red bold. A minus sign (- 
) denotes deletion (insertion). 

Click here for file 
[http://www.biomedcentral.com/content/supplementary/1471- 
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